Skip to content

[SPARK-21472][SQL] Introduce ArrowColumnVector as a reader for Arrow vectors.#18680

Closed
ueshin wants to merge 8 commits intoapache:masterfrom
ueshin:issues/SPARK-21472
Closed

[SPARK-21472][SQL] Introduce ArrowColumnVector as a reader for Arrow vectors.#18680
ueshin wants to merge 8 commits intoapache:masterfrom
ueshin:issues/SPARK-21472

Conversation

@ueshin
Copy link
Copy Markdown
Member

@ueshin ueshin commented Jul 19, 2017

What changes were proposed in this pull request?

Introducing ArrowColumnVector as a reader for Arrow vectors.
It extends ColumnVector, so we will be able to use it with ColumnarBatch and its functionalities.
Currently it supports primitive types and StringType, ArrayType and StructType.

How was this patch tested?

Added tests for ArrowColumnVector and existing tests.

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Jul 19, 2017

cc @BryanCutler @kiszk @cloud-fan

@Override
public boolean[] getBooleans(int rowId, int count) {
assert(dictionary == null);
NullableBitVector.Accessor accessor = boolData.getAccessor();
Copy link
Copy Markdown
Member

@kiszk kiszk Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use nulls? Ditto for other places.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid not, because the type of nulls is ValueVector.Accessor which has only simple methods such as isNull().
The concrete accessor APIs are different for each types.
Or should we cast nulls to the concrete type each time?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Can we keep NullableBitVector.Accessor instead of NullableBitVector while we keep the same reference in two instance variables. I am afraid about the cost of runtime cast in getBoolean() method rather than getBooleans() method.
This is why I expect get() method will be inlined into by a JIT compiler since each Accessor class is final.


@Override
public boolean getBoolean(int rowId) {
return boolData.getAccessor().get(rowId) == 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use nulls? Ditto for other places

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jul 19, 2017

Test build #79752 has finished for PR 18680 at commit 73899b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class ArrowColumnVector extends ColumnVector

*/
public abstract class ReadOnlyColumnVector extends ColumnVector {

protected ReadOnlyColumnVector(int capacity, MemoryMode memMode) {
Copy link
Copy Markdown
Member

@kiszk kiszk Jul 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to accept dataType as one of argument? To have the argument would be more flexible for future usages.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'll modify it to accept dataType but I guess we shouldn't pass it to ColumnVector to avoid illegally allocating child columns.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jul 19, 2017

Test build #79763 has finished for PR 18680 at commit ddfcf36.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ueshin for this. I made a first pass, I see a lot of things are scoped to public - is this intended to be a public API?

case _ => throw new UnsupportedOperationException(s"Unsupported data type: $dt")
}

def toArrowField(name: String, dt: DataType, nullable: Boolean): Field = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only used for testing?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is used to create an Arrow schema from StructType in ArrowUtils .toArrowSchema(), too.


import org.apache.spark.sql.types._

object ArrowUtils {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be private[sql]? also in other places

/**
* A column backed by Apache Arrow.
*/
public final class ArrowColumnVector extends ReadOnlyColumnVector {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this planned to be a public API right now?

}
resultStruct = new ColumnarBatch.Row(childColumns);
} else {
throw new UnsupportedOperationException();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this whole "if else" block be put into a pattern match instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, this class is written in Java, so we can't use a pattern match.

/**
* An abstract class for read-only column vector.
*/
public abstract class ReadOnlyColumnVector extends ColumnVector {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to refactor ColumnVector into classes that separate reading/writing so you could just extend the read portion instead of making this class that throws exceptions on writes? e.g.

ColumnVector -> ColumnVectorWritable -> ColumnVectorReadable
ArrowColumnVector -> ColumnVectorReadable

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it'd be better to refactor ColumnVector, but I think ColumnVector is related to ColumnarBatch or other classes, so we should do it, and also refactor ColumnarBatch at the same time, in the future PRs.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on separating the read/write, we should definitely do this before we publish the ColumnVector interfaces.

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Jul 20, 2017

@BryanCutler Thank you for reviewing!
As for scope, yes, I'd like these APIs to be public. Do you have any concerns about it?

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jul 20, 2017

Test build #79787 has finished for PR 18680 at commit 91b94ef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Copy Markdown
Contributor

@BryanCutler all classes under the execution package are meant to be private, in the future we will move them to a new package if we are ready to public them.

import org.apache.spark.unsafe.types.UTF8String;

/**
* A column backed by Apache Arrow.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: a column vector

public boolean[] getBooleans(int rowId, int count) {
boolean[] array = new boolean[count];
for (int i = 0; i < count; ++i) {
array[i] = accessor.getBoolean(rowId + i);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need to address this now, but do we have a better implementation with arrow? cc @BryanCutler

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan Jul 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of a batch read API.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked Arrow's API docs. I didn't find batch read API.

childColumns = new ColumnVector[1];
childColumns[0] = new ArrowColumnVector(listVector.getDataVector());
resultArray = new Array(childColumns[0]);
} else if (vector instanceof MapVector) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a unrelated question: why a vector for struct type is called MapVector in arrow? cc @BryanCutler

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the design decision behind it, but it's meant to lookup child vectors by name so uses a kind of hash map. I agree that another name would have been more intuitive.


@Override
final int getArrayLength(int rowId) {
return accessor.get(rowId + 1) - accessor.get(rowId);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the given rowId is the last row, is it still valid to call get(rowId + 1)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the offset vector for ListVector should have num of arrays + 1 values.

@cloud-fan
Copy link
Copy Markdown
Contributor

LGTM, pending jenkins

@SparkQA
Copy link
Copy Markdown

SparkQA commented Jul 20, 2017

Test build #79793 has finished for PR 18680 at commit 2d1dad9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@rxin
Copy link
Copy Markdown
Contributor

rxin commented Jul 20, 2017

Have you guys checked the performance of this change? It changes the number of concrete implementations for column vector from 2 to 3 (and potentially 1 to 2 at runtime). This might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches. (It depends on how we can column vector).

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Jul 21, 2017

@viirya and the original reporter, thank you for reporting it!
I submitted a follow-up pr #18701.

asfgit pushed a commit that referenced this pull request Jul 21, 2017
… for Arrow vectors.

## What changes were proposed in this pull request?

This is a follow-up of #18680.

In some environment, a compile error happens saying:

```
.../sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java:243:
error: not found: type Array
  public void loadBytes(Array array) {
                        ^
```

This pr fixes it.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes #18701 from ueshin/issues/SPARK-21472_fup1.
ghost pushed a commit to dbtsai/spark that referenced this pull request Aug 29, 2017
…ctor type.

## What changes were proposed in this pull request?

As mentioned at apache#18680 (comment), when we have more `ColumnVector` implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches.

As for read path, one of the major paths is the one generated by `ColumnBatchScan`. Currently it refers `ColumnVector` so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses `OnHeapColumnVector`. We can use the concrete type in the generated code directly to avoid the penalty.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@databricks.com>

Closes apache#18989 from ueshin/issues/SPARK-21781.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants